Project: Investigate a Dataset - TMDb Movies

Table of Contents

Introduction

Dataset Description

About Dataset: This data set contains information about 10,000 movies collected from The Movie Database (TMDb), including user ratings and revenue.

The columns include:
id: The unique identifier for each movie.
imdb_id: The unique identifier for each movie on IMDB site.
popularity: Popularity for each movie.
budget: Movie budget
revenue: Movie revenue
original_title: Movie title
cast: Movie cast
homepage: Movie webpage link
director: Movie director
tagline: Movie tagline
keywords: Movie keywords
overview: Movie overview
runtime: Movie runtimes
genres: Movie genres
production_companies: Movie production
release_date: Date of movive release
vote_count: Movie vote count
vote_average: Movie vote average
budget_adj: Movie budget of the associated movie in terms of 2010 dollars, accounting for inflation over time.
revenue_adj: Movie revenue of the associated movie in terms of 2010 dollars, accounting for inflation over time.

Question(s) for Analysis

1) Top directors with most movies
2) Top actors with most movies
3) Top actors and directors with high mean ratings
4) Top actors and directors with popularity ratings
5) Year with most movies released
6) Trendline of popularity and value average
7) Which genres are most popular from year to year?
8) What kind of properties are asssociated with movies of high revenues?

Data Wrangling

Assessing the data

Data Cleaning

From assessing the data, we can see that some columns have null values.
In this section, we will check for duplicates and take them out, then we find a solution to the null value instances.
Some unnecessary columns would be taken out, while some derived columns will be created.
Some datatypes are not appropriate and they will be changed.

Creating derived columns and dropping unneeded columns

New columns like profit, actor, genre and production_company will be derived from existing columns.

Columns like id, imdb_id, budget, revenue, homepage, tagline, keywords, overview, genres and
production_companies will be dropped.

Dealing with null values

It is observed that all numerical columns are void of null values, while the categorical columns which are strings are the ones with null values.
So, the null spaces will be filled with the string 'unspecified'.

Data Types

The data types of each column will be rechecked to ensure the right format of data for all columns.
Anyone found not appropriate will be converted to the correct one.

Exploratory Data Analysis

NOTE:

Take note that the data has some revenue_adj and budget_adj spaces containing 0 values.
Every case of this will result in an incorrect economical record of the movie, which in turn affects the general overview of the data.
To rectify this, we takeaway every of this case and see if what is left is enough to give an wholistic insights.

The dataset has been reduced by over 50%. It most likely is not ideal for insights, especially in contrast with the original dataset.
But, we'll go ahead to use it for some economic based insghts, even though we don't trust it's integrity.

Finding Relationships Between Variables

What kinds of properties are associated with movies that have high revenues?

Conclusions:

Actors, directors and production companies that do more movies above the popularity and rating average are more profitable. Popularity of movies increase over the years. Profit and revenue increases per year mainly because of the increased number of movies released per year. The average quality of movies released declined over the years. The average budget for movies have reduced relatively over the years. That could be related to the fact that more low budget movies were released as the years become more recent.

Most movies with high revenues have vote averages and popularity ratings above the general averages. Comedy, Action and Drama seems to be the most common genres and also the most profitable